{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "aa4b55d6",
   "metadata": {},
   "source": [
    "# Preventing a Database Reconstruction Attack with Differentially Private Synthetic Data\n",
    "\n",
    "## About the Attack\n",
    "This notebook is based off of a similar [notebook](https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/notebooks/attack_database_reconstruction.ipynb) from IBM's adversarial robustness toolkit (ART).\n",
    "\n",
    "It uses the [DatabaseReconstruction](https://github.com/Trusted-AI/adversarial-robustness-toolbox/blob/main/art/attacks/inference/reconstruction/white_box.py) inference attack.\n",
    "\n",
    "## Attack Methodology\n",
    "The attacker has access to a pretrained model trained on a full dataset, and all of the \"public\" data except a single specific missing row. This is akin to the following real world scenario: a single user requests that their data be disincluded from a dataset, and in response the data owner deletes the user's data from the public database in order to \"protect\" their information. The attack uses the given pretrained model as an estimator and infers the missing row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a8b460b0",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/lucasrosenblatt/opt/miniconda3/envs/smartnoise-tests/lib/python3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.utils.testing module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.utils. Anything that cannot be imported from sklearn.utils is now part of the private API.\n",
      "  warnings.warn(message, FutureWarning)\n"
     ]
    }
   ],
   "source": [
    "from snsynth.pytorch.nn.patectgan import PATECTGAN\n",
    "from snsynth.quail import QUAILSynthesizer\n",
    "from diffprivlib.models import LogisticRegression as DPLR\n",
    "from snsynth.pytorch.pytorch_synthesizer import PytorchDPSynthesizer\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11097d6a",
   "metadata": {},
   "source": [
    "### Dataset\n",
    "We use the Car Evaluation dataset [https://archive.ics.uci.edu/ml/datasets/car+evaluation] \n",
    "\n",
    "### Creating a Differentially Private synthesizer\n",
    "We create a differentially private synthetic dataset with PATECTGAN and QUAIL here (we will use this dataset to train our model)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "985d505b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Memory consumed by car:96896\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/lucasrosenblatt/opt/miniconda3/envs/smartnoise-tests/lib/python3.7/site-packages/sklearn/utils/validation.py:70: FutureWarning: Pass n_components=10 as keyword args. From version 0.25 passing these as positional arguments will result in an error\n",
      "  FutureWarning)\n",
      "/Users/lucasrosenblatt/opt/miniconda3/envs/smartnoise-tests/lib/python3.7/site-packages/sklearn/utils/validation.py:70: FutureWarning: Pass n_components=10 as keyword args. From version 0.25 passing these as positional arguments will result in an error\n",
      "  FutureWarning)\n",
      "/Users/lucasrosenblatt/opt/miniconda3/envs/smartnoise-tests/lib/python3.7/site-packages/sklearn/utils/validation.py:70: FutureWarning: Pass n_components=10 as keyword args. From version 0.25 passing these as positional arguments will result in an error\n",
      "  FutureWarning)\n",
      "/Users/lucasrosenblatt/opt/miniconda3/envs/smartnoise-tests/lib/python3.7/site-packages/sklearn/utils/validation.py:70: FutureWarning: Pass n_components=10 as keyword args. From version 0.25 passing these as positional arguments will result in an error\n",
      "  FutureWarning)\n",
      "/Users/lucasrosenblatt/opt/miniconda3/envs/smartnoise-tests/lib/python3.7/site-packages/sklearn/utils/validation.py:70: FutureWarning: Pass n_components=10 as keyword args. From version 0.25 passing these as positional arguments will result in an error\n",
      "  FutureWarning)\n",
      "/Users/lucasrosenblatt/opt/miniconda3/envs/smartnoise-tests/lib/python3.7/site-packages/sklearn/utils/validation.py:70: FutureWarning: Pass n_components=10 as keyword args. From version 0.25 passing these as positional arguments will result in an error\n",
      "  FutureWarning)\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import sys\n",
    "module_path = os.path.abspath(os.path.join('..'))\n",
    "if module_path not in sys.path:\n",
    "    sys.path.append(module_path)\n",
    "\n",
    "from load_data import load_data\n",
    "loaded_datasets = load_data(['car'])\n",
    "real_data = loaded_datasets['car']['data']\n",
    "\n",
    "# PATECTGAN generated samples\n",
    "def QuailClassifier(epsilon):\n",
    "            return DPLR(epsilon=epsilon, data_norm=5.02)\n",
    "\n",
    "def QuailSynth(epsilon):\n",
    "    return PytorchDPSynthesizer(epsilon=epsilon, preprocessor=None,\n",
    "                    gan=PATECTGAN(epsilon=epsilon, loss='cross_entropy', batch_size=50, pack=1, sigma=5.0))\n",
    "\n",
    "synth = QUAILSynthesizer(10.0, QuailSynth, QuailClassifier, 'class', eps_split=0.8)                \n",
    "synth.fit(real_data)\n",
    "sample_size = real_data.shape[0]\n",
    "# Note we withhold 4 samples for support samples, detailed below\n",
    "synthetic_data = synth.sample(sample_size-4)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "941442a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Here we add a dummy support row for each class\n",
    "# Occasionally, our synthesizer fails to generate a row of a certain class (usually \"1\" and \"3\")\n",
    "# and so breaks the ART DatabaseReconstructer (which needs matching dimensions for target class)\n",
    "# This is (very) bad practice generally, but ensures that our example does not break\n",
    "synthetic_support_rows = pd.DataFrame(columns=synthetic_data.columns, dtype=np.int64)\n",
    "for i in range(0,2):\n",
    "    synthetic_support_rows = synthetic_support_rows.append({'buying':0,\n",
    "                                                            'lug_boot': 0,\n",
    "                                                            'doors': 0, \n",
    "                                                            'maint': 0, \n",
    "                                                            'persons': 0,\n",
    "                                                            'safety': 0, \n",
    "                                                            'class': 1}, ignore_index=True)\n",
    "    synthetic_support_rows = synthetic_support_rows.append({'buying':0,\n",
    "                                                            'lug_boot': 0,\n",
    "                                                            'doors': 0, \n",
    "                                                            'maint': 0, \n",
    "                                                            'persons': 0,\n",
    "                                                            'safety': 0, \n",
    "                                                            'class': 3}, ignore_index=True)\n",
    "# Add support rows\n",
    "synthetic_data = synthetic_data.append(synthetic_support_rows)\n",
    "# Reset index\n",
    "synthetic_data.reset_index(drop=True, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49935b65",
   "metadata": {},
   "source": [
    "### Train/test splits for both real and synthetic data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b11c975d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import accuracy_score, classification_report, roc_auc_score\n",
    "seed = 42\n",
    "\n",
    "def return_train_splits(df, target, test_size=0.2):\n",
    "    Xy_train, Xy_test = train_test_split(df, test_size=test_size, stratify = df[target], random_state = seed)\n",
    "\n",
    "    X_train = Xy_train.drop(target, axis=1)\n",
    "    y_train = Xy_train[target]\n",
    "\n",
    "    X_test = Xy_test.drop(target, axis=1)\n",
    "    y_test = Xy_test[target]\n",
    "    return X_train, y_train, X_test, y_test \n",
    "\n",
    "X_train, y_train, X_test, y_test = return_train_splits(real_data, 'class')\n",
    "X_train_synth, y_train_synth, X_test_synth, y_test_synth = return_train_splits(synthetic_data, 'class')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0c8fbf4",
   "metadata": {},
   "source": [
    "### Our classifiers (non-private and private)\n",
    "As of right now, it seems that the IBM ART DatabaseReconstruction attack only works on the GaussianNB sklearn implementation, so we use this model for comparison. We train a non-private and private version of this model (the private version of this model is trained on the synthetic dataset we generated earlier)\n",
    "\n",
    "### Results (classifier ML utility)\n",
    "GaussianNB is not the ideal modeling approach for the car dataset, so neither the non-private nor the private models perform particularly well. They do, however, perform comparably (overall accuracy is ~0.64 for non-private versurs ~0.66 for private). We sometimes see this counter-intuitive result when the private synthetic data trained model is able to generalize slightly better than the non-private model.\n",
    "\n",
    "Note that the private model is evaluated as *TSTR* (train on synthetic data, test on real data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "a7fc6e59",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.45      0.23      0.31        77\n",
      "           1       0.00      0.00      0.00        14\n",
      "           2       0.74      0.94      0.83       242\n",
      "           3       0.00      0.00      0.00        13\n",
      "\n",
      "    accuracy                           0.71       346\n",
      "   macro avg       0.30      0.29      0.28       346\n",
      "weighted avg       0.62      0.71      0.65       346\n",
      "\n",
      "0.708092485549133\n",
      "              precision    recall  f1-score   support\n",
      "\n",
      "           0       0.56      0.06      0.12        77\n",
      "           1       0.00      0.00      0.00        14\n",
      "           2       0.84      0.84      0.84       242\n",
      "           3       0.14      1.00      0.25        13\n",
      "\n",
      "    accuracy                           0.64       346\n",
      "   macro avg       0.38      0.48      0.30       346\n",
      "weighted avg       0.71      0.64      0.62       346\n",
      "\n",
      "0.6416184971098265\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/lucasrosenblatt/opt/miniconda3/envs/smartnoise-tests/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n",
      "/Users/lucasrosenblatt/opt/miniconda3/envs/smartnoise-tests/lib/python3.7/site-packages/sklearn/metrics/_classification.py:1221: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
      "  _warn_prf(average, modifier, msg_start, len(result))\n"
     ]
    }
   ],
   "source": [
    "from sklearn.naive_bayes import GaussianNB\n",
    "from art.estimators.classification.scikitlearn import ScikitlearnGaussianNB\n",
    "\n",
    "clf = GaussianNB()\n",
    "clf.fit(X_train, y_train)    \n",
    "\n",
    "art_classifier = ScikitlearnGaussianNB(clf)\n",
    "\n",
    "clf_synth = GaussianNB()\n",
    "clf_synth.fit(X_train_synth, y_train_synth)    \n",
    "\n",
    "art_classifier_synth = ScikitlearnGaussianNB(clf_synth)\n",
    "\n",
    "def evaluate (clf, X_test, y_test):\n",
    "    y_pred = clf.predict(X_test)\n",
    "    print(classification_report(y_test, y_pred))\n",
    "    y_probs = clf.predict_proba(X_test)\n",
    "    print(accuracy_score(y_test, y_pred))\n",
    "\n",
    "evaluate(clf_synth, X_test, y_test)\n",
    "evaluate(clf, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a85fe84b",
   "metadata": {},
   "source": [
    "### Random target row selection\n",
    "From IBM: \"We now select a row from the training dataset that we will remove. This is the target row which the attack will seek to reconstruct. The attacker will have access to x_public and y_public.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "a2d9f156",
   "metadata": {},
   "outputs": [],
   "source": [
    "target_row = np.random.choice(X_train.index, 1, replace=False)\n",
    "\n",
    "x_public = X_train.drop(target_row)\n",
    "y_public = y_train.drop(target_row)\n",
    "\n",
    "x_public_synth = X_train_synth.drop(target_row)\n",
    "y_public_synth = y_train_synth.drop(target_row)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c46e0b8",
   "metadata": {},
   "source": [
    "### The Database Reconstruction attack\n",
    "Here we create 2 attacks: one with the non-private classifier, and the other with the private (synthetic data trained) classifier. \n",
    "\n",
    "Note that the private classifier is evaluated on attempting two reconstructions - the first, as though the synthetic data was released (this is an odd, unlikely scenario - hard to imagine a user requesting that fake data be disincluded from fake data - but worth looking at as it demonstrates how effective the privatization has been). The second scenario is against the real data (as though many users gave their permission to release data, but we want to protect users who opted out)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b3cc1b6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from art.attacks.inference.reconstruction import DatabaseReconstruction\n",
    "\n",
    "# Non-private Classifier\n",
    "dbrecon = DatabaseReconstruction(art_classifier)\n",
    "\n",
    "x_pub_numpy = x_public.to_numpy()\n",
    "y_pub_numpy = y_public.to_numpy()\n",
    "x_pub_numpy_synth = x_public_synth.to_numpy()\n",
    "y_pub_numpy_synth = y_public_synth.to_numpy()\n",
    "\n",
    "# Reconstructing the missing row\n",
    "x, y = dbrecon.reconstruct(x_pub_numpy, y_pub_numpy)\n",
    "\n",
    "# Private classifier reconstruction attack\n",
    "dbrecon_synth = DatabaseReconstruction(art_classifier_synth)\n",
    "\n",
    "# Reconstructing the \"missing row\" from the synthetic dataset\n",
    "x_synth_synth, y_synth_synth = dbrecon_synth.reconstruct(x_pub_numpy_synth, y_pub_numpy_synth)\n",
    "\n",
    "# Reconstructing the missing row from the real dataset\n",
    "x_synth_real, y_synth_real = dbrecon_synth.reconstruct(x_pub_numpy, y_pub_numpy)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e728d5f0",
   "metadata": {},
   "source": [
    "### Results\n",
    "The reconstruction attack is **highly** successful when the non-private classifier is released along with the real data - it is essentially able to reconstruct the missing row perfectly. With differential privacy, the attack is completely thwarted. Not only does the example row pass the sniff test, but the RMSE is orders of magnitude higher than without privatization (a good thing in this case!)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "39c9a307",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original row:[[3 1 0 1 1 0]]\n",
      "Reconstructed row (no DP):[[ 3.00000009  0.99999955 -0.00000016  0.99999999  1.         -0.00000001]]\n",
      "Reconstructed row (WITH DP) (synthetic data):[[1.99999304 1.99999983 0.00000016 1.0000024  0.00000018 1.00000959]]\n",
      "Reconstructed row (WITH DP) (real data):[[-0.83311108 -0.83311108  0.89986564 -0.73276206  0.1946248   0.52915717]]\n",
      "\n",
      "Inference RMSE (without differential privacy): 1.9742889242304893e-07\n",
      "Inference RMSE (WITH differential privacy) (synthetic data): 0.8164998879540726\n",
      "Inference RMSE (WITH differential privacy) (real data): 1.9490979010415337\n"
     ]
    }
   ],
   "source": [
    "np.set_printoptions(suppress=True)\n",
    "X_train_np = X_train_synth.to_numpy()\n",
    "original = X_train.loc[target_row].to_numpy()\n",
    "print('Original row:' + str(original))\n",
    "print('Reconstructed row (no DP):' + str(x))\n",
    "print('Reconstructed row (WITH DP) (synthetic data):' + str(x_synth_synth))\n",
    "print('Reconstructed row (WITH DP) (real data):' + str(x_synth_real))\n",
    "print()\n",
    "print(\"Inference RMSE (without differential privacy): {}\".format(\n",
    "    np.sqrt(((original - x) ** 2).sum() / X_train.shape[1])))\n",
    "\n",
    "print(\"Inference RMSE (WITH differential privacy) (synthetic data): {}\".format(\n",
    "    np.sqrt(((original - x_synth_synth) ** 2).sum() / X_train.shape[1])))\n",
    "\n",
    "print(\"Inference RMSE (WITH differential privacy) (real data): {}\".format(\n",
    "    np.sqrt(((original - x_synth_real) ** 2).sum() / X_train.shape[1])))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
